Data wrangling

Getting data into the right shape for plotting

Elizabeth King
Kevin Middleton

Why does data shape matter?

Wide vs. long data

Species Island Sex Year Bill_length Bill_depth
Adelie Torgersen Male 2007 39.1 18.7
Adelie Torgersen Female 2007 39.5 17.4
Adelie Torgersen Female 2007 40.3 18.0
Adelie Torgersen Female 2007 36.7 19.3
Adelie Torgersen Male 2007 39.3 20.6
Adelie Torgersen Female 2007 38.9 17.8
Adelie Torgersen Male 2007 39.2 19.6
Adelie Torgersen Female 2007 41.1 17.6
Adelie Torgersen Male 2007 38.6 21.2
Adelie Torgersen Male 2007 34.6 21.1
Adelie Torgersen Female 2007 36.6 17.8

Bar charts for counts

ggplot(penguins2, aes(x = Species)) +
  geom_bar() +
  labs(y = "Count")

Bar charts for counts

Grouped bar charts

ggplot(penguins2, aes(x = Species, fill = Sex)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("navy", "darkred")) +
  labs(y = "Count")

We will talk more about position = "dodge" next week.

Grouped bar charts

Grouped bar charts

ggplot(penguins2, aes(x = Species, fill = Island)) +
  geom_bar(position = "dodge") +
  labs(y = "Count") +
  facet_grid(Sex ~ .) +
  scale_fill_paletteer_d(`"beyonce::X56"`)

Grouped bar charts

Stacked bar charts

ggplot(penguins2, aes(x = Species, fill = Island)) +
  geom_bar(position = "fill") +
  labs(y = "Count") +
  facet_grid(Sex ~ .) +
  scale_fill_paletteer_d(`"beyonce::X56"`) +
  scale_y_continuous(name = "Percent", labels = scales::percent)

Stacked bar charts

Aggregating to get counts

penguin_counts <- penguins2 |> 
  group_by(Species, Island, Sex) |> 
  count()


# A tibble: 10 × 4
# Groups:   Species, Island, Sex [10]
   Species   Island    Sex        n
   <fct>     <fct>     <chr>  <int>
 1 Adelie    Biscoe    Female    22
 2 Adelie    Biscoe    Male      22
 3 Adelie    Dream     Female    27
 4 Adelie    Dream     Male      28
 5 Adelie    Torgersen Female    24
 6 Adelie    Torgersen Male      23
 7 Chinstrap Dream     Female    34
 8 Chinstrap Dream     Male      34
 9 Gentoo    Biscoe    Female    58
10 Gentoo    Biscoe    Male      61

Aggregating to get counts 2

penguin_counts_by_year <- penguins2 |> 
  group_by(Species, Island, Sex, Year) |> 
  count()


# A tibble: 30 × 5
# Groups:   Species, Island, Sex, Year [30]
   Species Island Sex     Year     n
   <fct>   <fct>  <chr>  <int> <int>
 1 Adelie  Biscoe Female  2007     5
 2 Adelie  Biscoe Female  2008     9
 3 Adelie  Biscoe Female  2009     8
 4 Adelie  Biscoe Male    2007     5
 5 Adelie  Biscoe Male    2008     9
 6 Adelie  Biscoe Male    2009     8
 7 Adelie  Dream  Female  2007     9
 8 Adelie  Dream  Female  2008     8
 9 Adelie  Dream  Female  2009    10
10 Adelie  Dream  Male    2007    10
# … with 20 more rows

Working with aggregated data

Galápagos plant diversity 1

Island Species Endemics Area Elevation
Baltra 58 23 25.09 100
Bartolome 31 21 1.24 109
Caldwell 3 3 0.21 114
Champion 25 9 0.10 46
Coamano 2 1 0.05 5
Daphne Major 18 11 0.34 120
Darwin 10 7 2.33 168
Eden Rock 8 4 0.03 90
Enderby 2 2 0.18 112
Espanola 97 26 58.27 198
Fernandina 93 35 634.49 1494
Gardner a 58 17 0.57 49
Gardner b 5 4 0.78 227
Genovesa 40 19 17.35 76
Isabela 347 89 4669.32 1707
Marchena 51 23 129.49 343
Onslow 2 2 0.01 25
Pinta 104 37 59.56 777
Pinzon 108 33 17.95 458
Las Plazas 12 9 0.23 25
Rabida 70 30 4.89 367
San Cristobal 280 65 551.62 716
San Salvador 237 81 572.33 906
Santa Cruz 444 95 903.82 864
Santa Fe 62 28 24.08 259
Santa Maria 285 73 170.92 640
Seymour 44 16 1.84 30
Tortuga 16 8 1.24 186
Wolf 21 12 2.85 253

Bar charts for aggregated data

stat = "identity" tells ggplot to use the value as is.

ggplot(gala2, aes(x = Island, y = Species)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90,
                                   hjust = 1, vjust = 0.5))

Bar charts for aggregated data

Arranging Species counts

gala2 |> 
  arrange(desc(Species)) |> 
  ggplot(aes(x = Island, y = Species)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90,
                                   hjust = 1,
                                   vjust = 0.5))

Arranging Species counts

Explicitly defining factors

ggplot converts chr to factors when used as a factor.

gala2 |> 
  arrange(desc(Species)) |> 
  mutate(Island = fct_inorder(Island)) |> 
  ggplot(aes(x = Island, y = Species)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90,
                                   hjust = 1, vjust = 0.5))

Explicitly defining factors

Rotating the axes

gala2 |> 
  arrange(desc(Species)) |> 
  mutate(Island = fct_inorder(Island)) |> 
  ggplot(aes(x = Island, y = Species)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90,
                                   hjust = 1, vjust = 0.5)) +
  coord_flip()

Rotating the axes

Reversing the x axis

gala2 |> 
  arrange(desc(Species)) |> 
  mutate(Island = fct_inorder(Island)) |> 
  ggplot(aes(x = Island, y = Species)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90,
                                   hjust = 1, vjust = 0.5)) +
  coord_flip() +
  scale_x_discrete(limits = rev)

Reversing the x axis

Histograms as a special bar chart

ggplot(penguins |> 
         drop_na(bill_length_mm),
       aes(bill_length_mm, color = species)) +
  geom_histogram(bins = 30, fill = NA)

Histograms as a special bar chart

Density plots better with overlap

ggplot(penguins |> 
         drop_na(bill_length_mm),
       aes(bill_length_mm, color = species)) +
  geom_density()

Density plots better with overlap

Pivoting data

Wide to long:

  • Multiple columns into rows

Long to wide:

  • Multiple rows into columns

Multiple columns to rows

penguins3 <- penguins |> 
  select(species, starts_with("bill")) |> 
  drop_na() |> 
  rename(Species = species,
         `Bill length` = bill_length_mm,
         `Bill depth` = bill_depth_mm)

penguins3
# A tibble: 342 × 3
   Species `Bill length` `Bill depth`
   <fct>           <dbl>        <dbl>
 1 Adelie           39.1         18.7
 2 Adelie           39.5         17.4
 3 Adelie           40.3         18  
 4 Adelie           36.7         19.3
 5 Adelie           39.3         20.6
 6 Adelie           38.9         17.8
 7 Adelie           39.2         19.6
 8 Adelie           34.1         18.1
 9 Adelie           42           20.2
10 Adelie           37.8         17.1
# … with 332 more rows

Multiple columns to rows

penguins_long <- penguins3 |> 
  pivot_longer(cols = -Species,
               names_to = "Type",
               values_to = "Measure")

penguins_long
# A tibble: 684 × 3
   Species Type        Measure
   <fct>   <chr>         <dbl>
 1 Adelie  Bill length    39.1
 2 Adelie  Bill depth     18.7
 3 Adelie  Bill length    39.5
 4 Adelie  Bill depth     17.4
 5 Adelie  Bill length    40.3
 6 Adelie  Bill depth     18  
 7 Adelie  Bill length    36.7
 8 Adelie  Bill depth     19.3
 9 Adelie  Bill length    39.3
10 Adelie  Bill depth     20.6
# … with 674 more rows

Plotting long data

ggplot(penguins_long,
       aes(x = Type, y = Measure, color = Species)) +
  geom_point(position = position_jitter(width = 0.1)) +
  facet_wrap("Species") +
  scale_color_paletteer_d(`"beyonce::X56"`) +
  theme(axis.title.x = element_blank()) +
  labs(y = "Measurement (mm)")

Plotting long data

Splitting or Joining Columns

# A tibble: 20 × 5
    Year Month   Day ID    Count
   <dbl> <dbl> <dbl> <chr> <int>
 1  2022     7     1 A-1      16
 2  2022     7     2 A-1      14
 3  2022     7     3 A-1      20
 4  2022     7     4 A-1      15
 5  2022     7     5 A-1      13
 6  2022     7     1 B-1      10
 7  2022     7     2 B-1       4
 8  2022     7     3 B-1       5
 9  2022     7     4 B-1       9
10  2022     7     5 B-1       7
11  2022     7    11 A-2      17
12  2022     7    12 A-2      19
13  2022     7    13 A-2      11
14  2022     7    14 A-2      14
15  2022     7    15 A-2      16
16  2022     7    11 B-2       3
17  2022     7    12 B-2       3
18  2022     7    13 B-2       3
19  2022     7    14 B-2       6
20  2022     7    15 B-2       4

Joining with unite()

library(lubridate)
M <- M |> 
  unite(col = "Date", c(Year, Month, Day), sep = "-") |> 
  mutate(Date = ymd(Date))
# A tibble: 20 × 3
   Date       ID    Count
   <date>     <chr> <int>
 1 2022-07-01 A-1      16
 2 2022-07-02 A-1      14
 3 2022-07-03 A-1      20
 4 2022-07-04 A-1      15
 5 2022-07-05 A-1      13
 6 2022-07-01 B-1      10
 7 2022-07-02 B-1       4
 8 2022-07-03 B-1       5
 9 2022-07-04 B-1       9
10 2022-07-05 B-1       7
11 2022-07-11 A-2      17
12 2022-07-12 A-2      19
13 2022-07-13 A-2      11
14 2022-07-14 A-2      14
15 2022-07-15 A-2      16
16 2022-07-11 B-2       3
17 2022-07-12 B-2       3
18 2022-07-13 B-2       3
19 2022-07-14 B-2       6
20 2022-07-15 B-2       4

Splitting with separate()

M <- M |> 
  separate(ID, sep = "-", into = c("Replicate", "Timepoint"))
# A tibble: 20 × 4
   Date       Replicate Timepoint Count
   <date>     <chr>     <chr>     <int>
 1 2022-07-01 A         1            16
 2 2022-07-02 A         1            14
 3 2022-07-03 A         1            20
 4 2022-07-04 A         1            15
 5 2022-07-05 A         1            13
 6 2022-07-01 B         1            10
 7 2022-07-02 B         1             4
 8 2022-07-03 B         1             5
 9 2022-07-04 B         1             9
10 2022-07-05 B         1             7
11 2022-07-11 A         2            17
12 2022-07-12 A         2            19
13 2022-07-13 A         2            11
14 2022-07-14 A         2            14
15 2022-07-15 A         2            16
16 2022-07-11 B         2             3
17 2022-07-12 B         2             3
18 2022-07-13 B         2             3
19 2022-07-14 B         2             6
20 2022-07-15 B         2             4

Plotting dates

ggplot(M, aes(x = Date, y = Count, color = Replicate)) +
  geom_point() +
  geom_path() +
  scale_x_date(date_labels = "%m/%d", date_breaks = "1 day")

Plotting dates

Plotting dates

Plot shapes: https://r-graphics.org/recipe-scatter-shapes

ggplot(M, aes(x = Date, y = Count, shape = Replicate)) +
  geom_point(size = 4) +
  scale_shape_manual(values = c(0, 15)) +
  scale_x_date(date_labels = "%m/%d", date_breaks = "1 day")

Plotting dates